[00:01.550 --> 00:11.010]  So, welcome to our next talk by Nahid Ghali and Vincent Pham, who I believe are both researchers
[00:11.010 --> 00:17.710]  at Capital One, and they're presenting a detection system for ML-based threats.
[00:18.890 --> 00:22.070]  So let's let them take it away.
[00:23.570 --> 00:28.570]  To our talk, Spectrum, an end-to-end framework for ML-based threat monitoring and detection.
[00:28.570 --> 00:35.730]  I'm Vincent Pham, and my co-author, Nahid Farhadi. We're both software developers at
[00:35.730 --> 00:44.270]  Capital One. We're going to talk about a platform that targets insider threat. So for those
[00:44.270 --> 00:49.550]  of you who are unfamiliar with what insider threat is, we have the formal definition here.
[00:49.550 --> 00:54.010]  So insider threat is the potential for an individual who has or had authorized access
[00:54.010 --> 00:59.390]  to an organization's asset to use that access, either maliciously or unintentionally, to
[00:59.390 --> 01:04.390]  act in a way that could negatively affect the organization. This unintentionally is
[01:04.390 --> 01:09.450]  also important because we're not looking at only malicious or suspicious people. Also
[01:09.450 --> 01:14.670]  people who have naively or unknowingly leaked data from the company as well, since they
[01:14.670 --> 01:20.730]  are a potential threat. I think this comic on the right side pretty much clarifies who
[01:20.730 --> 01:26.470]  are insider threats. Basically when you think of the entire compass of who are possible
[01:26.470 --> 01:30.990]  threats in the world, it's pretty much the entire population and then who are insider
[01:30.990 --> 01:37.570]  threats. That's the people who work within a company. Even if you have zero risk or very
[01:37.570 --> 01:46.330]  minimal risk, there's still a potential threat there. And just to give you a little bit about
[01:46.930 --> 01:53.850]  the insider threat stats, 90% of organizations feel vulnerable to insider attacks. 53% of
[01:53.850 --> 01:59.410]  companies had insider threat attack against their organization in the last 12 months. 37%
[01:59.410 --> 02:05.330]  of employees have excessive access privileges. This could be important because think about
[02:05.830 --> 02:11.650]  your most important data and what would happen if that your company and who has access to that
[02:11.650 --> 02:18.010]  data. 94% of companies monitor their employees digital footprint. So this is the data sources
[02:18.010 --> 02:23.870]  that we could use to track where the data is going, who has access to what. And then 86%
[02:23.870 --> 02:30.050]  of companies have or are building an insider threat program to date. And then also another
[02:30.050 --> 02:37.590]  important fact is the average cost of insider threats is currently about estimated at $1.6
[02:37.590 --> 02:48.510]  million. So we mentioned that there's 86% companies who are thinking about or are currently
[02:48.510 --> 02:53.830]  building an insider threat system. You might be thinking should you buy a platform that is
[02:53.830 --> 02:59.830]  currently have a solution developed or should you build your own solution to insider threat?
[03:00.210 --> 03:05.990]  We list some of the pros of each of the different buying versus build. So if you think about buying
[03:05.990 --> 03:12.810]  a platform, the pros of that is that a solution is already developed. So it could possibly be
[03:12.810 --> 03:19.370]  simple as a plug and play. Although in most cases, it'll take a couple of months or maybe
[03:19.370 --> 03:24.870]  even a couple of years just to integrate that platform into your system. Also support is
[03:24.870 --> 03:30.310]  provided so you don't have to build up your own team, you could pay for support to install that
[03:30.310 --> 03:37.590]  system and grant you the analysis to insider threat. Lastly, the data management and handling
[03:37.590 --> 03:41.450]  solution is also provided by the company as well. So once again, you don't have to build your own
[03:41.450 --> 03:46.930]  team. The pros of building your own system is that technology is always changing. So if your
[03:46.930 --> 03:51.970]  company is thinking about changing its technology, changing its database, or just changing the way
[03:52.590 --> 03:58.530]  everything is handled, moving from maybe a server or a local server to the cloud,
[03:58.530 --> 04:05.050]  you could quickly adapt to that need. If your system is constantly being audited, you have
[04:05.050 --> 04:09.530]  full knowledge of the systems, you don't have to wait for your vendor to give you back some
[04:09.530 --> 04:15.110]  information, or maybe they're not even going to release the information. When you're building
[04:15.110 --> 04:21.210]  it yourself, you already have that information. And then once again, due to ever-changing technology,
[04:21.210 --> 04:26.710]  you might have changing data as well. So then you could adapt your platform really quickly to
[04:26.710 --> 04:38.910]  that changing data. So in order to combat insider threat, we built a system called Spectrum,
[04:38.910 --> 04:44.710]  which operates on two types of entities and the relationship. The entities are the subjects and
[04:44.710 --> 04:53.810]  the objects. The subjects are identified as the types that take action, such as in a very
[04:53.810 --> 04:58.530]  simple sense for insider threat are the employees. But you could also consider
[04:59.330 --> 05:06.890]  AM roles in AWS, which is like a group of profiles gathered together. Or if you're doing insider
[05:06.890 --> 05:12.530]  threats, or I guess external threats, you can also think about customers as your subjects.
[05:12.530 --> 05:19.190]  And then the objects are entities in which actions are taken. And those actions are recorded as
[05:19.190 --> 05:23.590]  event trails. So you could think of like anytime someone logs into a VPN, what time they logged
[05:23.590 --> 05:31.490]  in at, if they batch into a building, if they opened a laptop, events such as that are recordable
[05:31.490 --> 05:44.140]  and provides evidence to whether an insider is happening or not. So the proposed framework
[05:44.920 --> 05:52.100]  for Spectrum follows such that if we go from the very top, we define a set of entities. So
[05:52.540 --> 05:58.360]  once again, in this case, that would be the employees, we find out the different data
[05:58.360 --> 06:06.420]  sources that are linked to the entity. So we might have, in this case, three different events source.
[06:06.780 --> 06:12.840]  And then from each of the individual events source, we have features for that event source.
[06:12.840 --> 06:19.300]  For example, let's say you're using your email data set, that email data set will have features
[06:19.300 --> 06:25.240]  such as where the email is being sent to, where it's being sent from, the time of day the email
[06:25.240 --> 06:31.400]  is sent, the subject, and so on and so on. And then from the features, we can construct patterns.
[06:32.520 --> 06:40.040]  For example, has the email been sent within the past 24 hours? Does the content of the email
[06:40.500 --> 06:48.520]  display a positive or negative emotion? And so on. We could construct rules from it,
[06:48.520 --> 06:53.160]  we could also do anomaly detection on the feature itself. So is there something suspicious
[06:53.160 --> 06:58.700]  based away from the baseline? And then we can also build machinery models such as classifier
[06:58.700 --> 07:05.540]  to identify, once again, whether particular events or set of feature shuffle up something
[07:05.540 --> 07:12.200]  that's suspicious or not. So once we construct all of these rules, anomaly classifiers,
[07:12.780 --> 07:16.740]  we tied back the dependencies to extractor, which ties back to event source.
[07:16.740 --> 07:22.840]  And then that's how we built the entire pipeline from the bottom up. And then when we run the
[07:22.840 --> 07:30.120]  pipeline, it then reconstructs this entire scheme and tied back all the data sources so that if you
[07:30.120 --> 07:34.980]  turn off one pattern, then it might turn off a particular source just to save some cost in the
[07:34.980 --> 07:44.840]  meantime. Our data is then enriched to the role, the entity, and that is stored into storage with
[07:44.840 --> 07:52.920]  the entity and the enriched data of whether a particular role is suspicious or not or whether,
[07:53.620 --> 07:59.020]  I guess, the classification is marking as a zero to one probability of suspiciousness.
[07:59.500 --> 08:07.060]  These information are then used to flag a particular entity, maybe another reduced flag,
[08:07.060 --> 08:12.420]  or maybe open up a case management so that analysts can look into, or it could even be
[08:12.420 --> 08:18.760]  displayed in a UI so that analysts can go through this ranked order of suspicious entities, suspicious
[08:18.760 --> 08:23.480]  events, and see whether this is really a true positive or false positive in terms of
[08:23.480 --> 08:30.280]  inside threat or not. So, in a nutshell, this is what our system of spectrum looks like.
[08:32.100 --> 08:41.400]  All right. So, this proposed framework had some techniques being done on it in order to effectively
[08:43.620 --> 08:50.480]  save money on cost of data handling or making sure that we don't lose any data.
[08:50.660 --> 08:56.600]  So, using modes of operation and also dependency orchestration. I'm going to go through some of
[08:56.600 --> 09:01.820]  the techniques that we use to build this whole framework. The first one is that we have
[09:02.340 --> 09:08.860]  a unified source of the data. Basically, when we look at different sources of information,
[09:08.860 --> 09:13.520]  like digital footprint, it could be any digital interaction that the employee has
[09:13.520 --> 09:18.620]  with the outside world. It could be proxy monitoring, mail servers, different sensors
[09:18.620 --> 09:27.240]  that we have on endpoints or user machines or on cloud accounts that monitor all of these
[09:27.980 --> 09:34.580]  traffics and the communication that we have with the outside world. We can have different
[09:34.580 --> 09:41.940]  tools or event monitoring like windows, event log, sysmon, snort, etc. And each of these resources
[09:41.940 --> 09:49.240]  provide a specific type of information for us. But the important thing is that when we want to
[09:49.240 --> 09:55.200]  run the job and we want to detect whether there has been any insider threats actions over the
[09:55.200 --> 10:03.860]  past 24 hours, it is important to only query the past 24 hours for that specific data.
[10:03.860 --> 10:09.680]  Now, at the development time, your developers are going to develop different features and they
[10:09.680 --> 10:16.840]  are going to each of them query against all of these data sources, which is very costly.
[10:16.840 --> 10:26.180]  So our proposed solution to have a cost efficiency in a spectrum development phase is that
[10:26.820 --> 10:34.060]  we actually had all of these data sources. We processed them and we sessionized them and we
[10:34.060 --> 10:41.020]  extracted the information that we really need from each data source into a specific table. And then
[10:41.020 --> 10:48.340]  we had a Matillion job running on a timely basis, like five times a day, six times a day, as your
[10:48.340 --> 10:55.840]  customer requests you to extract that information, process them and keep them in a table for us.
[10:55.840 --> 11:03.100]  The size of this table is an order of magnitude smaller compared to the size of the data sources
[11:03.100 --> 11:09.100]  that you originally have. Plus your developers wouldn't go by themselves and each of them query
[11:09.100 --> 11:15.000]  those huge tables. Now you have unified all of this information in one table and it is going
[11:15.000 --> 11:19.320]  every time you want to develop, you want to test, you want to add a new feature,
[11:19.320 --> 11:26.460]  you are going to query the smaller table. So this actually saved us 90 percent of the cost
[11:26.460 --> 11:35.420]  of the querying. We use Snowflake as the source of the data where we keep our original information
[11:35.420 --> 11:42.660]  and then we use Matillion in order to have a schedule for running jobs on extracting information
[11:42.660 --> 11:50.940]  and processing them from Snowflake. So here is an example of how we keep the data.
[11:51.640 --> 11:58.780]  The other benefit or other feature that we propose in implementation of a spectrum is that
[11:58.780 --> 12:05.280]  we unify the data. This brings a second benefit for us. First one was that, okay,
[12:05.280 --> 12:15.180]  we had some cost savings. The second benefit is that this data frame is like an object and it is
[12:15.180 --> 12:23.400]  moving along the pipeline and different actions are being executed on this data frame. The
[12:23.400 --> 12:30.320]  important thing is that we don't complicate our design by separating different data sources in
[12:30.320 --> 12:36.520]  different tables, having each of those tables have different number of roles. For example,
[12:36.520 --> 12:43.600]  in email, you have a data type, definitely a hash or a unique ID. It is related to an entity ID,
[12:43.600 --> 12:50.120]  it could be an employee, it could be an IAM role or whatever entity that is doing the action,
[12:50.120 --> 12:56.860]  and it has a timestamp. But in meta, you have to from number of attachments, etc.
[12:56.860 --> 13:03.920]  The same information for a proxy data is very different. Of course, you have the first four
[13:03.920 --> 13:11.080]  columns, but in meta, you have probably the URL that has been accessed, the type of connection,
[13:11.080 --> 13:17.920]  is it HTTP or what type of connection is it, is it blocked by the company or not,
[13:17.920 --> 13:25.740]  or etc. Other type of information. So imagine if we had different data frames for each of these
[13:25.740 --> 13:33.540]  data types. And you know, you can have up to 20 data sources and having different tables for each
[13:33.540 --> 13:40.260]  of these is going to make your design very complicated and very inflexible against making
[13:40.260 --> 13:47.960]  changes. But with the proposed method, we keep information related to each data type in a JSON
[13:47.960 --> 13:57.040]  format in just one column, and that column is named meta. This way, our data frame can have
[13:57.040 --> 14:03.120]  different types of information, but all of them are in the same format, all of them are in one
[14:03.120 --> 14:09.120]  data frame, and that data frame is going through the pipeline, and each of the modules is responsible
[14:09.120 --> 14:16.600]  for doing specific actions on it. That's why we had extractors. Those extractors actually extract
[14:16.600 --> 14:20.740]  features and stuff from the meta field.
[14:23.600 --> 14:30.500]  Another design thinking that was behind this proposed framework is modes of operation.
[14:31.060 --> 14:38.980]  So as you know, like any other machine learning system, we have a normal job that is running,
[14:38.980 --> 14:48.360]  usually in the POC phase, we call it the test one, which is actually running the trained model
[14:48.360 --> 14:55.500]  against your real-time data and extracting results. We call that mode modeling, and we
[14:55.500 --> 15:03.360]  run it six times a day, or more per user's request. We have a training phase. In that phase,
[15:03.360 --> 15:09.740]  we only train the models. We have several baseliners, we have several classifiers.
[15:09.740 --> 15:14.980]  Those baseliners and anomaly detections are for user and entity behavior monitoring.
[15:14.980 --> 15:21.140]  So we compare employees against their own behavior, or against their peers' behavior,
[15:21.140 --> 15:26.820]  or role-based type of anomaly detection. So another phase of operation is training.
[15:26.820 --> 15:32.920]  In training, obviously, we extract more or longer period of the data.
[15:33.920 --> 15:40.580]  And that's actually where the Matillion job and having a unified table helps us.
[15:40.580 --> 15:46.500]  In modeling, we only look at the past 24 hours of the data. In training, we look at, for example,
[15:46.500 --> 15:55.840]  90 days of the data, or a longer period of time. So in this case, now that we have two different
[15:55.840 --> 16:03.420]  phases, and we separated these duties from each other, we make sure not to run really long queries
[16:03.420 --> 16:09.540]  for training. We have performance monitoring. In this performance monitoring, we monitor the
[16:09.540 --> 16:16.120]  thresholds and metrics for our machine learning models, for classifiers and anomaly detectors,
[16:16.120 --> 16:24.840]  and make sure that they are acting normally. They are not having any problems with the sources of
[16:24.840 --> 16:32.920]  the data, or with the baseliners, or if we generate alerts off of these, we go back to
[16:32.920 --> 16:39.660]  training and we fine-tune our model. We also have a backfill mode, which is something that happens
[16:39.660 --> 16:45.240]  usually in production, is that sometimes your sensors don't work, sometimes you don't have the
[16:45.240 --> 16:53.120]  source of the data available, there are problems with running jobs in order to filling out those
[16:53.120 --> 17:01.880]  tables. So in backfill mode, we run the same modeling, but this time on a longer period of
[17:01.880 --> 17:08.980]  time to make sure that we haven't missed on any of the events due to problems in production and
[17:08.980 --> 17:18.960]  providing the data. Another important part of this modeling job is a dependency orchestrator.
[17:18.960 --> 17:25.340]  As I said before, we have different, and Vincent mentioned before, we have different phases of
[17:26.680 --> 17:34.940]  operation. First, we extract stuff from our data sources, then after we extract those data,
[17:34.940 --> 17:43.800]  we have three main techniques in order to find out if an event is malicious or not. We have
[17:43.800 --> 17:49.500]  rule-based techniques, which we call pattern matchers, we have baseliners, which is anomaly
[17:49.500 --> 17:56.180]  detection type of job, and then we have machine learning component. Now, each of these might
[17:56.180 --> 18:03.680]  depend on each other. For example, on a baseliner, you might want to run the baseliner on a feature
[18:03.680 --> 18:10.580]  that is extracted, or you want to run a baseliner on a specific data type, for example, on print
[18:10.580 --> 18:17.260]  operations. That doesn't necessarily need to wait on extracting information from email,
[18:17.260 --> 18:23.840]  right? It's just a baseliner on print operation, so there is no dependency between email events
[18:23.840 --> 18:30.680]  and print events. So, in the dependency orchestrator, we come up with a plan that
[18:30.680 --> 18:37.840]  what is the most optimized way to run these modules based on their dependencies. You may
[18:37.840 --> 18:43.560]  have machine learning components that depend on several baseliners or several pattern matchers,
[18:43.560 --> 18:49.440]  you may run different baseliners that depend on two features, and finally, you may detect the
[18:49.440 --> 18:58.820]  maliciousness of something if a specific feature is extracted and it is violating an anomaly. So,
[18:58.820 --> 19:05.420]  there is all these dependencies. Here is an example in the module dependency. For example,
[19:05.420 --> 19:12.420]  here, we read the data from our original database and imagine that each of these colors is an event
[19:12.420 --> 19:18.780]  type. For example, this is email, this is proxy, and this is print. We extract information from email,
[19:18.780 --> 19:23.440]  we extract information from print, for example, number of pages, printer that is used, and we
[19:23.440 --> 19:32.540]  extract information from proxy. Then, we may want to run a baseliner on the email events and see that,
[19:32.540 --> 19:38.420]  amount of emails that someone is sending on a daily basis, or we may want to look at the number
[19:38.420 --> 19:43.580]  of proxy, what is the number of, what is the amount of upload and download for proxy that we
[19:43.580 --> 19:52.000]  use in a baseliner. Then, later on, we call these as detectors. These detectors might say that I
[19:52.000 --> 19:58.840]  think this person is malicious if they printed something and if they had some sort of anomaly
[19:58.840 --> 20:06.000]  behavior in terms of proxy. So, the dependency of this proxy is both on the baseliner and on the
[20:06.000 --> 20:14.100]  extractor of the second event type. So, basically, the job of dependency orchestrator is to come up
[20:14.100 --> 20:21.240]  with a plan and find out what is the dependency for each of these modules and make sure that they
[20:21.240 --> 20:27.740]  are ordered in a way that each module meets its dependency when it's time for the module to be
[20:27.740 --> 20:33.920]  executed. Finally, we have a store operation where we store all of those results in Postgres
[20:34.680 --> 20:42.780]  database or any other type of database and we have a list of threats that we have detected for
[20:42.780 --> 20:52.000]  each of the entities. Another capability that we have in this implementation in the proposed
[20:52.000 --> 21:00.140]  framework, it provides us scalability. First of all, for the processing engine,
[21:00.140 --> 21:09.640]  we use Spark on AWS EMR and it is composed of fleet instance and we have some spot instances,
[21:09.640 --> 21:16.400]  some on-demand instances on EC2. As soon as the number of jobs increase, the number of data,
[21:16.400 --> 21:20.800]  the number of events that we are monitoring or the period that we are monitoring,
[21:21.120 --> 21:29.700]  it can totally scale out and perform the job successfully. Another scalability from another
[21:29.700 --> 21:38.340]  point of view is that we have added RBAC restricted views for our events. For example,
[21:38.340 --> 21:45.640]  if you have different customers with different levels of access control to the entities that
[21:45.640 --> 21:52.680]  you are monitoring or to the events that you are monitoring, you can use RBAC and role-based
[21:52.680 --> 22:03.320]  access control in order to make sure that, for example, the head of insider threat monitoring
[22:03.320 --> 22:10.560]  can have access to all of the threats, can see all of the activities of, for example,
[22:10.560 --> 22:18.020]  executive or leadership of the company, but maybe a junior security analyst doesn't have
[22:18.020 --> 22:27.980]  as much capability. So we added into our detectors, we have a field that we set a level of access
[22:28.620 --> 22:34.880]  for each of those threats, or it can be based on role, based on a threat, or based on the
[22:34.880 --> 22:42.680]  activity that has been executed. This will give us the capability to provide service for multiple
[22:42.680 --> 22:50.620]  customers as well. The same framework that is looking at the same exact same data,
[22:50.620 --> 22:57.200]  it can be used for insider threat detection, but if we have other customers, we can see that
[22:57.200 --> 23:03.480]  if there was any network intrusion, so this is basically CSOC's interest, if there is any website
[23:03.480 --> 23:10.060]  abuse, or in case of financial companies, we can see if there is any insider trading
[23:10.680 --> 23:16.440]  or any fraud done by the employees. This doesn't necessarily have to be looking at specific
[23:17.860 --> 23:25.520]  entities or limited to any relationship between those entities. As long as the application is
[23:25.520 --> 23:30.880]  monitoring and looking at anomaly detection, we can simply change the source of the data
[23:30.880 --> 23:35.860]  and modify the dependencies and actually extract the results.
[23:37.380 --> 23:41.760]  Thank you so much for your time and attention.
[23:42.080 --> 23:46.060]  Here's our emails. We're happy to answer any questions.
